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Product Distribution (PD) theory is the information-theoretic extension of conventional full- 
rationality game theory to bounded rational games. Here PD theory is used to investigate games 
in which the players use bounded rational best-response strategies. This investigation illuminates 
how to determine the optimal organization chart for a corporation, or more generally how to order 
the sequence of moves of the players / employees so as to optimize an overall objective function. It 
is then shown that in the continuum-time limit, bounded rational best response games result in a 
variant of the replicator dynamics of evolutionary game theory. This variant is then investigated for 
team games, in which the players share the same utility function, by showing that such continuum- 
limit bounded rational best response is identical to Newton- Raphson iterative optimization of the 
shared utility function. Next PD theory is used to investigate changing the coordinate system of 
the game, i.e., changing the mapping from the joint move of the players to the arguments in the 
utility functions. Such a change couples those arguments, essentially by making each players’ move 
be an offered binding contract. 


I. INTRODUCTION 

Recent work has shown that information theory [1-3] 
provides a principled extension of noncooperate conven- 
tional game theory to accommodate bounded rationality 
[4]. Intuitively, this extension is based on Occam’s ra- 
zor: Given only partial knowledge concerning a game's 
(bounded rational) equilibrium, introduce as little extra 
information as possible beyond that partial knowledge 
in inferring the joint mixed strategy of that equilibrium. 
This is formalized by setting the joint mixed strategy of 
the game’s equilibrium, q{x € X) — ]7i Qi( x i), to the 
minimizer of a set of Lagrangian functions. 

The field of Probability Collectives concerns the op- 
timization of distributions over the variables of interest, 
rather than the optimization of those variables directly. 
The special case considered here, where the joint distri- 
bution over the variables of interest is a product distri- 
bution, is known as Product Distribution (PD) theory 
[5-11]. This paper uses PD theory to investigate several 
aspects of game theory not considered in [4]. PD theory 
is applied to games in which the players use bounded ra- 
tional versions of best-response strategies. It is also used 
to investigate changing the coordinate system of a game, 
i.e., changing the mapping from the joint move of the 
players to the arguments in the utility functions. Such 
changes couple those arguments, essentially by making 
the players’ moves be offered binding contracts. This 
paper also uses PD theory to illuminate recent work in 
adaptive distributed control. 

Sec. II reviews how information theory can be used 
to derive bounded rational noncooperative game theory. 
Some simple examples of bounded rational games are 
then presented. 

Sec. Ill analyzes scenarios in which the players use 
bounded rational versions of best response strategies. 
Particular attention is paid to team games, in which the 
players share the same utility function. The analysis for 


this case provide insight into how to optimize the se- 
quence of moves by the players, as far as their shared 
utility is concerned. This can be viewed as a formal way 
to optimize the organization chart of a corporation. 

Best response strategies, even bounded rational ones, 
are poor models of real-world computational players that 
use Reinforcement Learning (RL) [12-15]. Sec. IV con- 
siders iterated games in which players use a (bounded 
rational) variant of best response, a variant that is more 
realistic for computational players, and arguably for hu- 
man players as well. In this variant the conditional ex- 
pected utilities used by each player to update her strat- 
egy is a decaying average of recent conditional expected 
utilities; this implements a bias by the player to dampen 
large and sudden changes in her strategy. This variant 
is then explored for the case of team games. The con- 
tinuum limi t of the dynamics of such games is shown to 
be variant of the replicator dynamics. It is shown such 
continuum- limi t, bounded rational best response is iden- 
tical to Newton-Raphson iterative optimization of the 
shared utility function of such games. 

The next section investigates changing the coordinate 
system of the game. By doing this the moves of the play- 
ers get transformed, into (bounded rational) contracts 
binding them. Some of the implications for optimal or- 
ganization charts of such bounded rational contracts are 
elucidated, as well as their use to speed convergence in 
team games. 

This paper ends with a discussion of how these results 
relate to other work, and with a brief overview of exten- 
sions of these resuls. 


II. PD THEORY AS BOUNDED RATIONAL 
NONCOOPERATIVE GAME THEORY 

In this section we motivate PD theory as the 
information-theoretic formulation of bounded rational 
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game theory. We use the integral sign (/) with the asso- 
ciated measure implicit, i.e., it indicates sums if appro- 
priate, Lebesgue integrals over M n if appropriate, etc. In 
addition, the subscript (i) is used to indicate all index 
values other than i. Finally, we use V to indicate the set 
of all probability distributions over a vector space, and Q 
to indicate the subset of V consisting of all product dis- 
tributions (i.e., the associated Cartesian product of unit 
simplices). 


A. Review of noncooperative game theory 

In noncooperative game theory one has a set of N 
players. Each player i has its own set of allowed 
pure strategies. A mixed strategy is a distribu- 
tion qi(xi) over player i’s possible pure strategies. Each 
player i also has a private utility function g t that 
maps the pure strategies adopted by all N of the play- 
ers into the real numbers. So given mixed strategies 
of all the players, the expected utility of player i is 
E(g t ) = Jdx UjQj( x j)9i{x) [44]. 

In a Nash equilibrium every player adopts the mixed 
strategy that maximizes its expected utility, given the 
mixed strategies of the other players. More formally, 
Vi, qi = argmax 9 , f dx qlYij^iQj^j) 9i( x )- Perhaps the 
major objection that has been raised to the Nash equilib- 
rium concept is its assumption of full rationality [16- 
20]. This is the assumption that every player i can both 
calculate what the strategies will be and then calcu- 
late its associated optimal distribution. In other words, 
it is the assumption that every player will calculate the 
entire joint distribution q(x) = JJj 9j( x j)- 

In the real world, this assumption of full rationality 
almost never holds, whether the players are humans, ani- 
mals, or computational agents [11, 16, 21-27]. This is due 
to the cost of computation of that optimal distribution, 
if nothing else. This real-world bounded rationality is 
one of the major impediments to applying conventional 
game theory in the real world. 


B. Review of the minimum information principle 

Shannon was the first person to realize that based 
on any of several separate sets of very simple desider- 
ata, there is a unique real-valued quantification of the 
amount of syntactic information in a distribution P(y). 
He showed that this amount of information is the nega- 
tive of the Shannon entropy of that distribution, S(P) = 
— f dy P(y)ln[-^|j]. So for example, the distribution 
with minimal information is the one that doesn’t dis- 
tinguish at all between the various y, i.e., the uniform 
distribution. Conversely, the most informative distribu- 
tion is the one that specifies a single possible y. Note 
that for a product distribution, entropy is additive, i.e., 

S(Yli<Ii(yi)) = Ei-S'fe)- 


Say we given some incomplete prior knowledge about a 
distribution P{y)- How should one estimate P(y) based 
on that prior knowledge? Shannon’s result tells us how to 
do that in the most conservative way: have your estimate 
of P(y) contain the minimal amount of extra informa- 
tion beyond that already contained in the prior knowl- 
edge about P(y)- Intuitively, this can be viewed as a 
version of Occam’s razor: introduce as little extra infor- 
mation beyond that you are provided in your inferring 
of P. This minimum information approach is called the 
maxent principle. It has proven extremely powerful in do- 
mains ranging from signal processing to supervised learn- 
ing [2]. In particular, it is has been successfully used in 
many statistics applications, including econometrics [28]. 
It has even provided what many consider the cleanest 
derivation of the foundations of statistical physics [29]. 


C. Maxent Lagrangians 

Much of the work on equilibrium concepts in game the- 
ory adopts the perspective of an external observer of a 
game. We are told something concerning the game, e.g., 
its cost functions, information sets, etc., and from that 
wish to predict what joint strategy will be followed by 
real-world players of the game. Say that in addition to 
such information, we are told the expected utilities of the 
players. What is our best estimate of the distribution q 
that generated those expected cost values? By the max- 
ent principle, it is the distribution with maximal entropy, 
subject to those expectation values. 

To formalize this, for simplicity assume a finite num- 
ber of players and of possible strategies for each player. 
To agree with the convention in fields other than game 
theory (e.g., optimization, statistical physics, etc.), from 
now on we implicitly flip the sign of each gt so that the as- 
sociated player i wants to minimize that function rather 
than maximize it. Intuitively, this flipped gi(x) is the 
“cost” to player i when the joint-strategy is a:. 

With this convention, given prior knowledge that the 
expected utilities of the players are given by the set of 
values {fii}, the maxent estimate of the associated q is 
given by the minimizer of the Lagrangian 

&(q) = £>[£,(*) -eJ-Sfa) (1) 

i 

= dx II qj( x j)9i( x ) ~ e *l _ 

i j 

where the subscript on the expectation value indicates 
that it evaluated under distribution q. The {/?,} are “in- 
verse temperatures” implicitly set by the constraints on 
the expected utilities. 

Solving, we find that the mixed strategies minimizing 
the Lagrangian are related to each other via 

qi(xi) oc e ’<*>' 


( 3 ) 
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where the overall proportionality constant for each i is 
set by normalization, and G = 2~ t - Pi9i [45]. In Eq. 3 the 
probability of player i choosing pure strategy x t depends 
on the effect of that choice on the utilities of the other 
players. This reflects the fact that our prior knowledge 
concerns all the players equally. 

If we wish to focus only on the behavior of player i, 
it is appropriate to modify our prior knowledge. To see 
how to do this, first consider the case of maximal prior 
knowledge, in which we know the actual joint-strategy of 
the players, and therefore all of their expected costs. For 
this case, trivially, the maxent principle says we should 
“estimate” q as that joint-strategy (it being the q with 
maximal entropy that is consistent with our prior knowl- 
edge). The same conclusion holds if our prior knowledge 
also includes the expected cost of player i. 

Modify this maximal set of prior knowledge by re- 
moving from it specification of player i’s strategy. So 
our prior knowledge is the mixed strategies of all play- 
ers other than i. together with player i’s expected cost. 
We can incorporate prior knowledge of the other players’ 
mixed strategies directly, without introducing Lagrange 
parameters. The resultant maxent Lagrangian is 

= PilG ~ E(gi)] - Si(qi) 

= PA f -i - / dx Y\q 3 {xj)gi{x)} - S l (qi) 

J i 

solved by a set of coupled Boltzmann distributions: 
qi (xi) oce -AK, m (9 ‘ |li) . (4) 

Following Nash, we can use Brouwer’s fixed point the- 
orem to establish that for any non-negative values {/?}, 
there must exist at least one product distribution given 
by the product of these Boltzmann distributions (one 
term in the product for each i). 

The first term in is minimized by a perfectly ra- 
tional player. The second term is minimized by a per- 
fectly irrational player, i.e., by a perfectly uniform mixed 
strategy q t . So Bi in the maxent Lagrangian explicitly 
specifies the balance between the rational and irrational 
behavior of the player. In particular, for 3 —* oo, by min- 
imizing the Lagrangians we recover the Nash equilibria 
of the game. More formally, in that limit the set of q 
that simultaneously minimiz e the Lagrangians is the set 
of mixed strategy equilibria of the game, together with 
the set of delta functions about the pure Nash equilibria 
of the game. The same is true for Eq. 3. 

Note also that independent of information-theoretic 
considerations, the Boltzmann distribution is a reason- 
able (highly abstracted) model of how human players will 
behave. Typically humans do some “exploration” as well 
as “exploitation” , trying out all moves, with frequency as 
the expected cost of the move increases. This is captured 
in the Boltzmann distribution mixed strategy. 

One can formalize the concept of the rationality of a 
player in a way that applies to any distribution, not just a 


Boltzmann distribution. One does this with a rational- 
ity upei aiui w men map/o a y cuiu o. y , tu a non negative 
real value measuring the rationality of player i in adopt- 
ing strategy q t given private cost function g x and strate- 
gies of the other players. For the solution in Eq. 4 
and private cost gi, the value of that operator is just Bi 

[4]- 

Eq. 3 is just a special case of Eq. 4, where all player’s 
share the same private cost function, G. (Such games are 
known as team games.) This relationship reflects the 
fact that for this case, the difference between the maxent 
Lagrangian and the one in Eq. 2 is independent of &. 
Due to this relationship, our guarantee of the existence 
of a solution to the set of maxent Lagrangians implies 
the existence of a solution of the form Eq. 3. Typically 
players will be closer to minimi zing their expected cost 
than maximizing it. For prior knowledge consistent with 
such a case, the Bi 3X6 all non-negative. 

For each player i define 

f t (x,qi(xi)) = /3igi(x ) 4- ln^fy,)]. 

Then we can write the maxent Lagrangian for player i as 

Bfi(q) = J dx q(x)fi(x,qi(xi)). (6) 

Now in a bounded rational game every player sets its 
strategy to minimize its Lagrangian, given the strategies 
of the other players. In light of Eq. 6, this means that we 
can interpret each player in a bounded rational game as 
being perfectly rational for a cost function that incorpo- 
rates its computational cost. To do so we simply need to 
expand the domain of “cost functions” to include (loga- 
rithms of) probability values as well as joint moves. 

D. Examples of bounded rationed equilibria 

It can be difficult to start with a set of cost functions 
and associated rationalities Bi and then solve for the as- 
sociated bounded rational equilibrium q. Solving for q 
when prior knowledge consists of expected costs e, rather 
than rationalities can be even more tedious. (In that sit- 
uation the Bi 3X6 not specified upfront but instead are 
Lagrange parameters that we must solve for.) However 
there is an alternative approach to constructing exam- 
ples of games and their bounded rational equilibria that 
is quite simple. In this alternative one starts with a par- 
ticular mixed strategy q and then solves for a game for 
which q is a bounded rational equilibrium, rather than 
the other way around. 

To illustrate this, consider a 2-person noncooperative 
single-stage game. Let each player have 3 possible moves. 
Indicate each players’ three possible moves by the nu- 
merals 0, 1, and 2. Say the (bounded rational) mixed 
strategy equilibrium is 

<Zi(0) = 1/2, «i(l) = l/4, 5,(2) = 1/4; 

52 (0) = 2/3, <z 2 (l) = 1/4, 5 2 (2) = 1/12 . (7) 
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Now we know that at the equilibrium, qi(x \ ) oc 
e -/3i£ ; (9iki) ) where /?, is player l’s rationality, and g x 
is her cost function (the negative of her cost function). 
This means for example that 

g-(/ME(«il*i=0)-.E(9 1 |*i = l)]) - gl W _ 9 

" 9i(l) " ’ 

i.e., 

P\[E{gi | xi = 0) - E{g x | x x = 1)] = -ln(2). (8) 

We have a similar equation for the remaining indepen- 
dent difference in expectation values for player 1. The 
analogous pair of equations for player 2 also hold. 

Now define the vectors g i:j (.) = gi(xi = j,.). So for 
example gi ;0 = (pi(xj = 0,x 2 = 0 ),g l (x 1 = 0,x 2 = 
l),gi(xi =0,^2 = 2)). Then we can express our equa- 
tions compactly as four dot product equalities: 

A(gi ; o - gi ; i) • <72 = — ln(2), 

Pi (gi;0 - gi|2) • Q2 = —In (2) ; 

P 2 (g2;0 — g2;l) ' 9l = ~ ln('8/3), 

A(g2;0 ~ g2;2) ‘ <7l = -ln(8) . (9) 

Note that we can absorb each Pi into its associated g,-; 
all that matters is their product. 

We can now plug in for the vectors qq and q 2 from 
Eq. 7 and simply write down a set of solutions for the 
four three-dimensional vectors gtj. For these {g,} the 
bounded ratinal equilibrium is given by the q of Eq. 7. 
If desired, we can evaluate the associated expected values 
of the cost functions for the two players; our q is the 
bounded ratinal equilibrium for thos eexpected costs. 

Note that the variables in the first pair of equalities 
in Eq. 9 are independent of those in the second pair. In 
other words, whereas the Boltzmann equations giving q 
for a specified set of gi are a set of coupled equations, the 
equations giving the g t for a specified q are not coupled. 
Note also that our equations for the g i-j are (extremely) 
underconstrained. This illustrates how compressive the 
mapping from the g t to the associated equilibrium q is. 
Bear in mind though that that mapping is also multi- 
valued in general; in general a single set of cost functions 
■ can have more than one equilibrium, just like it can have 
more than one Nash equilibrium. 

The generalization of this example to arbitrary num- 
bers of players with arbitrary move spaces is immediate. 
As before, indicate the moves of every player by an asso- 
ciated set of integer numerals starting at 0. Recall that 
the subscript (i) on a vector indicate all components but 
the i’th one. Also absorb the rationalities A into the 
associated g t . 

Now specify q and the vectors g i {x l ~ 0, .) (one vec- 
tor for each i) to be anything whatsoever. Then for 
all players i, the only associated constraint on the i’th 
cost function concerns certain projections of the vectors 


g t (x, > 0, .) (one projection for each value Xi >0). Con- 
cretely, Vi, Xi > 0, 

/ dx {i ) 9i (Xi , x ' {i) ) q d (x'j) = 

-ln(|g) + j dx^gi^x'^Hq^), (10) 

i.e., Vi , Xi > 0, 

gi(*t, •) • Q(i) = — ln( + gi(0, •) ' g(» )• (11) 

All the terms on the right-hand side are specified, as well 
as the <7(i) term on the left-hand side. Any g l (xi , .) that 
obeys the associated equation has the specified q as a 
bounded rational equilibrium. 

E. Discussion 

There are numerous alternative interpretations of the 
information-theoretic formulation of bounded rationality 
presented here. For example, change our prior knowledge 
to be the entropy of each player i’s strategy, i.e., how 
unsure it is of what move to make. Now we cannot use 
information theory to make our estimate of q. Given 
that players try to minimize expected cost, a reasonable 
alternative is to predict that each player V s expected cost 
will be as small as possible, subject to that provided value 
of the entropy and the other players’ strategies. The 
associated Lagrangians are ai[S(qi) — cq] — E(gi), where 
cq is the provided entropy value. This is equivalent to 
the maxent Lagrangian, and in particular has the same 
solution, Eq. 4. 

Another alternative interpretation involves world 
cost functions, wdiich are quantifications of the quality 
of a joint pure strategy x from the point of view of an ex- 
ternal observer (e.g., a system designer, the government, 
an auctioneer, etc.). A particular class of world cost 
functions are (negatives of) “social welfare functions”, 
which can be expressed in terms of the cost functions of 
the individual players. Perhaps the simplest example is 
G(x) = J •ZiPigiix ), where the Pi serve to trade off how 
much we value one player’s cost vs. anothers. If we know 
the value of this social welfare function, but nothing else, 
then maxent tells us to minimize the Lagrangian of Eq. 2. 

Often our prior knowledge will not consist of exact 
specification of the expected costs of the players, even 
if that knowledge arises from watching the players make 
their moves. Such alternative kinds of prior knowledge 
are addressed in [4, 6], In particular, in those refer- 
ences it is shown how one might define a “rationality 
operator” that quantifies the rationality of any pair of 
a player’s mixed strategy and cost function, given the 
mixed strategey of all the other players. If one’s prior 
knowledge is the the values of the rationalities of the 
players, then one ^ain arises at solutions of the form in 
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Eq. 4, where the value of /% reflects the rationality of that 
player. 

In addition, in the real world the information we are 
provided concerning the system often will not consist of 
exact values of functionals of q , be those values expected 
costs, rationalities, or what have you. Rather that knowl- 
edge will be in the form of data, D, together with an 
associated likelihood function over the space of q. For 
example, that knowledge might consist of a bias toward 
particular rationality values, rather than precisely speci- 
fied values: 

P(D | q ) oc 

where a sets the strength of the bias. 

As mentioned in the introduction, these results can 
also be extended in many ways (e.g., to allow multiple 
cost functions, variables numbers of players, etc.). Some 
such extensions are explored below. 


III. BOUNDED RATIONAL VERSIONS OF 
BEST RESPONSE 

One crude way to try to find the q given by Eq. 4 would 
be an iterative process akin to the best-response scheme 
of game theory [16]. Given any current distribution q, 
in - this scheme all agents i simultaneously replace their 
current distributions. In this replacement each agent i 
replaces q t with the distribution given in Eq. 4 based on 
the current qyy This scheme is the basis of the use of 
Brouwer’s fixed point theorem to prove that a solution to 
Eq. 4 exists. Accordingly, it is called parallel Brouwer 
updating. (This scheme goes by many names in the lit- 
erature, from Boltzmann learning in the RL community 
to block relaxation in the optimization community.) 

Sometimes the conditional expected cost for each agent 
can be calculated explicitly at each iteration. More gen- 
erally, it must be estimated. This can be done via Monte- 
Carlo sampling, iterated across a block of time through- 
out which q is unchanging. During that block the agents 
all repeatedly and jointly IID sample their probability 
distributions to generate joint moves, and the associated 
cost values are recorded. These are then use to estimate 
all the conditional expected costs, which are then used 
to determine the parallel Brouwer update[46]. 

This is exactly what is done in RL-based schemes in 
which each agent maintains a data-based estimate of its 
cost for each of its possible moves, and then chooses its 
actual move stochastically, by sampling a Boltzmann dis- 
tribution of those estimates. (See [5] for ways to get ac- 
curate MC estimates more efficiently than in this simple 
scheme, e.g., by exploiting the bias- variance tradeoff of 
statistics.) 

One alternative to parallel Brouwer updating is serial 
Brouwer updating, where we only update one q t at a 
time. This is analogous to a Stackelberg game, in that 
one agent makes its move and then the other(s) respond 
[17, 19]. In a team game, any serial Brouwer updating 


must reduce the common Lagrangian, in contrast to the 
case wiLn paiauei uiOuwci updating. 

There are many versions of serial updating. In cyclic 
serial Brouwer updating, one cycles through the i in or- 
der. In random serial Brouwer updating, one cycles 
through them in a random fashion. 

In greedy serial Brouwer updating, instead of cy- 
cling through all i, at each iteration we choose what 
player to update based on how much that will re- 
duce the common Lagrangian. Those reductions can 
be evaluated without explicitly calculating the associ- 
ated Boltzmann distributions. To see how, use Ni to 
indicate the normalization constant of Eq. 4. Then de- 
fine the Lagrangian gap at q for player i as lnfiV^ + 
f dxiqi(xi)E q{i) (gi | x*) + f dx i qi(x 1 )lnlq i (xi)}. This is 
how much Jz? is reduced if only qi undergoes the Brouwer 
update [47]. 

Another obvious variant of these schemes is mixed se- 
rial/parallel Brouwer updating, in which one subset of 
the players moves in synchrony, followed by another sub- 
set, and so on. Such updating in a team game can be 
viewed as a simple model of the organization chart of the 
players. For example, this is the case when the players 
are a corporation, with G being a common cost function 
based on the corporation’s performance. 

Say we observe the functioning of such an organization 
over time, and view those observations as Monte Carlo 
sampling of its behavior. Then we use those samples 
to statistically estimate how best to do serial/parallel 
Brouwer updating, for the purpose of minimizing the 
shared cost function G. This can be viewed as a way 
to optimize the organization chart coupling the players. 


IV. PARALLEL BROUWER WITH 
DATA-AGING IS NEAREST NEWTON 

This section considers a variant of best-response that 
is more realistic (more accurately modeling RL-based 
computational players that are actually used in machine 
learning, and arguably more accurately modeling human 
players as well). In this variant the expected cost used 
by each player to update her strategy is a decaying av- 
erage of recent expected utilities; this decay reflects a 
conservative preference for dampening large changes in 
strategy. 

Such a bias is used (implicitly or otherwise) in most 
multi-player RL algorithms. For example, in the COIN 
framework each agent i collects a data set of pairs of what 
value its private cost function has at timestep t together 
with the move it made then. It then estimates its cost 
for move x, as a weighted average of all the cost values 
in its data set for that move. The weights are exponen- 
tially decaying functions of how long ago the associated 
observation was made. This data-aging is crucial to re- 
flect the non-stationarity of agent i's environment, i.e., 
the fact that the other agents are changing their strate- 
gies with time. Arguably similar modifications to best 
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response are used by human players. Indeed, in ideal- 
ized learning rules like ficticious play, such dampening is 
crucial. 


A. The dynamics of Brouwer updating 

Consider a multi-stage game where at the end of it- 
eration t, each player i updates her distribution qi(.,t) 
to 


B. The continuum-time limit 


To go to the continuum-time limit, let t be a real vari- 
able, and replace the temporal delay value of 1 in Eq. 14 
with 5 and a with aS (we’ll eventually take 6 — * 0). In 
addition differentiate Eq. 12 with respect to t to get 


dqjxj, t ) 
dt 


-Qi{Xi,t)[ 


/ 


d& i(xj,t) 

dt 


(:X\, , t) 


d4?,(x^, t) , 

dt 1 


(15) 


Qi(xi,t) 


g ~*i(Xi,t) 


( 12 ) 


In addition, in the <5 — > 0 limit, assuming q is a continuous 
function of t, Eq. 14 becomes 


This is a generalization of parallel Brouwer updating, 
where the function being exponentiated can be Q values 
(as in Q-learning [30]), single-instant reward values, dis- 
torted versions of these (e.g., to incorporate data-aging), 
etc. 

As an example, for single-instant rewards (i.e., conven- 
tional parallel Brouwer), cfo(xj, t ) is player V s estimate of 
(Pi times) her conditional expected cost for taking move 
Xi at time t — 1. If that estimate were exact, this would 
mean 

$>i(xi,t) = /3E(g l I Xi) 

= P f t l)di{Xif £(i))- (13) 


d ^" q) = alMx^q) - $dxi,q)l ( 16 ) 

at 

where from now on the t variable is being suppressed for 
clarity. 

If we knew the dynamics of (pi, we could solve Eq. 16 
via integrating factors, in the usual way. Instead, here 
we’ll plug that equation for into Eq. 15. Then use 
Eq. 12 to write $i(£t,<f) = constant -ln(g 2 (xj)). The 
result is 

= aqi(xi) [<pi(xi,q) + ln(<ji(xi))] 

at 

-a J dx'i qi(x' i )[4>i(x , i ) + ln(<fe(x())]. (17) 


As another example, for Q-learning, one player is Nature 
and her distribution is always a delta function. In this 
case $i(xi, t) is the Q- value for player i taking action x,, 
when the state of Nature is as specified by the associated 
delta function in q(., t — 1). 

Note that there’s no Monte Carlo sampling being done 
here, as there is in most real-world RL; this is a somewhat 
abstracted version of such RL. Alternatively, the analy- 
sis here becomes exact when T, is evaluated closed form, 
or (as when T, is an empirical expectation value) there’s 
enough samples in a Monte Carlo block so that empiri- 
cal averages effectively give us exact values of expected 
quantities. 

At this point we have to say something about how T, 
evolves with time. Consider the case where <!>, is an es- 
timate of some function (pi, formed by exponential aging 
of the previous </> values. In our case (since everything 
is evaluated closed form) assuming there have been an 
infinite number of preceding timesteps, this is the same 
as geometric data-aging: 

*i(xi, t) = a<pi(xi, q(t - 1)) + (1 - a)$i(xj, t - 1) (14) 

for some appropriate function rpi [48]. For example, in 
parallel Brouwer updating, <pi(xi,t) = pE(gi | x,, qa) (f)), 
while $i(xi, t) is a geometric average of the previous val- 
ues of <p(Xi). 


C. Relation with Nearest Newton descent and 
replicator dynamics 

As mentioned previously, there axe many ways to find 
equilibria, and in particular many distributed algorithms 
for doing so. This is especially so in team games, where 
finding such equilibria reduces to descending a single 
over-arching Lagrangian. 

One natural idea for descent in such games is to use 
the Newton- Raphson descent algorithm. However that 
algorithm cannot be applied directly to search across q 
in a distributed fashion, due to the need to invert ma- 
trices coupling the agents. As an alternative, one can 
consider what new distribution p the Newton algorithm 
would step to if there was no restriction that p be a 
product distribution. One can then ask what product 
distribution is closest to p, according to Kullback-Leibler 
distance [1]. It turns out that one can solve for that op- 
timal product distribution. The associated update rule 
is called the Nearest Newton algorithm [31]. 

It turns out that when one writes down the Nearest 
Newton update rule, it says to replace each component 
q t (xi) with the exact quantity appearing on the right- 
hand side of Eq. 17, where a is the stepsize of the update, 
and <pi(xi,t ) = PE(G \ x l ,q {i )(t)), as in parallel Brouwer 
updating for a team game [49]. In other words, in team 
games, the continuum limit of having each player using 
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(bounded rational) best response is identical to the con- 
tinuum limit of the Newton-Raphson algorithm for de- 
scending the Lagrangian, with the data-aging parameter 
a giving the stepsize. 

Eq. 17 arises in other yet other contexts as well. In 
particular, say 4>j is conditional expected rewards (i.e., 
(j)i{xi,t— 1) = E(gi | q(.,t— 1))). Then the 0 — * oc limit 
of Eq. 17 reduces to a simplified form of the replicator 
dynamics equation of evolutionary game theory [32, 33]. 
(If the stepsize a is an appropriately increasing function 
of E(G) other versions of that dynamics arise.) This is 
because in that limit the In term disappears, and the 
righthand side of Eq. 17 involves only the difference be- 
tween player i’s expected cost and the average expected 
cost of all players. This 3-way connection suggests using 
some of the techniques for solving replicator dynamics to 
expedite either parallel Brouwer or Nearest Newton. 


D. Convergence and equilibria 


By Eq. 17, at equilibrium, for each i, q 1 (xi')[4> t (x l , q) + 
ln(qi(xi))] must be independent of i. One way this 
can occur is if it equals 0. However qi(xi) can never 
be 0, by Eq. 12. This means we have an equilibrium 
at qi{xi) oc e^'C 2 ’ 9 ). Intuitively, this is exactly what 
we want, according to Eq. 12 and our interpretation of 
<f>i(xi,q) as an estimate of <Pi(x t , q). Note also that this 
solution means that <pi(xi,q) — $,(xi, q), so that (accord- 
ing to Eq. 16) dqfxj, q) has also reached an equilibrium. 

When our equilibrium has q i (x l )[<pi(x l , q) + 
ln(< 7 i(xi))] = A ^ 0, we have 


qi(xi) 


oc e 




(18) 


In light of Eq. 12, this means that d? t (x t , q) ^ </ t (x t , q). 
So by Eq. 16, T^x,, q) hasn’t reached an equilibrium in 
this case: 


d$j(xi,q) 

dt 


= a<t>i(xi,q)[l -qi(xi)]. 


(19) 


If both < 7 i(Xj) and <pi (x,, q) were frozen at this point, 
this solution for &i(xi,q) would not obey Eq. 14. So 
either qi(xi) and/or </ t (x t , q) cannot be frozen. In fact, if 
<Pi (xi , q) varies with time, then we know by Eq. 17 that 
qi(xi) varies as well. So in either case < 7 i(x t ) must vary, 
i.e., this equilibrium is not stable. 

Although the dynamics has the desired fixed point, it 
may take a long time to converge there. There are sev- 
eral ways to analyze that: One is to examine the second 
derivatives (with respect to time) of the qi and/or the 
$i. Another is to examine the time-dependence of the 
residual error, 

rr(Xi,i) = fdxie-W’*) ~ fdx'e-^’ty (20) 

The next subsection includes a convergence analysis in- 
volving residual errors, but for a different variant of 
Brouwer from the ones considered so far. 


E. Other variants of Brouwer updating 

Note that data-aging can be viewed as moving only 
part-way from the current to what it should be (i.e. 
to (pi). As an alternative, one can dispense with the 
and <pi altogether, and instead step part-way from the 
current q to what it should be. This is partial movement 
to the (bounded rational) best response mixed strategy. 

Formally, this means replacing Eq. 12 so that the up- 
date is not implicit, in how 4> t (x t , t) depends on the past 
value of q(t — 1) (Eq. 14), but explicit: 


9i(xt, t) = qi(x u t- 1) + 

a{hi(xi,q {i) (t - 1)) - qi{xi,t- 1)] (21) 

where hi{xi^q^)(t)) is the Boltzmann distribution of 
what qi{xi,t ) would be, under ideal circumstances, and 
we implicitly have small stepsize a. 

The only fixed point of this updating rule is where qi = 
hi Vi. So just like with continuum-limit parallel Brouwer, 
we have the correct equilibrium. To investigate how fast 
the update rule of Eq. 21 arrives at that equilibrium, 
write its error at time t as the residual 

rf(Xi,t ) = qi(x r ,t) - hi(xi,q {i) (t)) 

= qi(xi,t - 1)[1 - a] + ahi{x u q {i) (t- 1)) 

- hi(xi,q {i) (t)) 

= qi(xi,t — 1)[1 — a] + ahi(x u q {i) {t- 1)) 

- ft»[xj, ■</(*)(£ — 1) + 

a[ft (i) ( ff (£-l))-g w (£-l)]] (22) 

where we have assumed that all all players other than i 
are updating themselves in the same that i does (i.e., via 
Eq. 21), and h^(q(t— 1)) means the vector of the values 
of all hj^ t (Xj ) evaluated for q(t — 1). 

With obvious notation, rewrite Eq. 22 as 

rf(xi,t) = qi(x u t- 1)[1 - a] 

+ ochi{xi,q {i) {t- 1)) 

- hi[xi, q {i) (t - 1) - ar {i) (t - 1)](23) 

Now use the fact that a is small to expand the last hi 
term on the righthand side to first order in its second 
(vector-valued) argument, getting the final result 

rf(xj,£) « ri(xi,f)[l - a] + a Vhi ■ r (i) (t - 1) (24) 

where the gradient of hi is with respect to the vector com- 
ponents of its second argument. Accordingly, if rf t (x i ) 
starts much larger than the other residuals, it will be 
pushed down to their values. Conversely, if it starts much 
smaller than them, it will rise. 

There are other ways one can reduce a stochastic game 
to a deterministic continuum-time process besides those 
considered here. In particular, this can be done in closed 
form for ficticious play games and some simple variants 
of it [16,34]. 


V. STATISTICALLY COUPLING THE PLAYERS 

A. The semicoordinate system of a game 

Consider a multi-stage game like chess, with the stages 
(i.e., the instants at which one of the players makes a 
move) delineated by t. Now strategies are what are set 
by the players before play starts. So in such a multi-stage 
game the strategy of player i, Xi, must be the set of t- 
indexed maps taking what that player has observed in 
the stages t' < t into its move at stage t. Formally, this 
set of maps is called player V s normal form strategy. 

The joint strategy of the two players in chess sets their 
joint move-sequence, though in general the reverse need 
not be true. In addition, one can always find a joint 
strategy to result in any particular joint move-sequence. 
Now typically at any stage there is overlap in what the 
players have observed over the preceding stages. This 
means that even if the players’ strategies are statistically 
independent, their move sequences are statistically cou- 
pled. In such a situation, by parameterizing the space 
Z of joint-move-sequences z with joint-strategies x, we 
shift our focus from the coupled distribution P(z) to the 
decoupled product distribution, q(x). This is the advan- 
tage of casting multi-stage games in terms of normal form 
strategies. 

More generally, any onto mapping ( : x — ■> z, not nec- 
essarily invertible, is called a semicoordinate system. 
The identity mapping 2 — ► z is a trivial example of a 
semicoordinate system. Another example is the map- 
ping from joint-strategies in a multi-stage game to joint 
move-sequences is an example of a semicoordinate sys- 
tem. In other words, changing the representation space 
of a multi-stage game from move-sequences z to strate- 
gies x is a semicoordinate transformation of that game. 

We can perform a semicoordinate transformation even 
in a single-stage game. Say we restrict attention to dis- 
tributions over X that are product distributions. Then 
changing ((.) from the identity map to some other func- 
tion means that the players’ moves are no longer inde- 
pendent. After the transformation their move choices — 
the components of 2 — are statistically coupled, even 
though we are considering a product distribution. 

Formally, this is expressed via the standard rule for 
transforming probabilities, 

P z (z eZ) = CC Px) = J dxPx(x)6(z - CO)), (25) 

where Px and P z are the distributions across X and Z , 
respectively. To see what this rule means geometrically, 
let V be the space of all distributions (product or other- 
wise) over Z. Recall that Q is the space of all product 
distributions over X, and let ((Q) be the image of Q in 
V. Then by changing C(-), we change that image; differ- 
ent choices of C(-) will result in different manifolds £(Q). 

As an example, say we have two players, with two 
possible moves each. So z consists of the possible joint 
moves, labeled (1, 1), (1,2), (2, 1) and (2,2). Have X = 


Z, and choose C(l, 1) = (1, 1), CM = (2, 2), C(2, 1) = 
(2,1), and (2, 2) = (1,2). Say that q is given by 
Qi(xi = 1) = 92(2:2 = 1) = 2/3. Then the distribu- 
tion over joint-moves 2 is P z ( 1,1) = Px(l, 1) = 4/9, 
P Z ( 2,1) = Pz{ 2,2) = 2/9, P z ( 1,2) = 1/9. So P z {z) ± 
Pz(zi)P z (z 2 ); the moves of the players are statistically 
coupled, even though their strategies Xi are independent. - 
Such coupling of the players’ moves can be viewed as a 
manifestation of sets of potential binding contracts. To 
illustrate this return to our two player example. Each 
possible value of a component Xi determines a pair of 
possible joint moves. For example, setting x\ = 1 means 
the possible joint moves are (1, 1) and (2, 2). Accordingly 
such a value of Xi can be viewed as a set of proffered bind- 
ing contracts. The value of the other components of x de- 
termines which contract is accepted; it is the intersection 
of the proffered contracts offered by all the components of 
x that determines what single contract is selected. Con- 
tinuing with our example, given that x\ = 1, whether the 
joint-move is (1, 1) or (2,2) (the two options offered by 
Xi) is determined by the value of x 2 - 


B. Representational properties 

Binding contracts are a central component of coopera- 
tive game theory. In this sense, semicoordinate transfor- 
mations can be viewed as a way to convert noncoopera- 
tive game theory into a form of cooperative game theory. 
Indeed, any cooperative mixed strategy can be cast as a 
non-cooperative game mixed strategy followed by an ap- 
propriate semicoordinate transformation. Formally, any 
Pz, no matter what the coupling among its components, 
can be expressed as ((Px) for some product distribution 
Px for and associated £(.) [50] 

Less trivially, given any model class of distributions 
{P z } , there is an X and associated £(•) such that {P z } 
is identical to £(Qx)- Formally this is expressed in a 
result concerning Bayes nets. For simplicity, restrict at- 
tention to finite Z. Order the components of Z from 1 
to N. For each index i € {1,2,..., N}, have the parent 
function V(i,z) fix a subset of the components of 2 with 
index greater than z, returning the value of those compo- 
nents for the 2 in its second argument if that subset of 
components is non-empty. So for example, with N > 5, 
we could have V(l,z) = (22,2:5). Another possibility is 
that P(l, 2) is the empty set, independent of 2. 

Let A(V) be the set of all probability distributions P z 
that obey the conditional dependencies implied by V: 
VP z e A(V), z € Z, 

N 

P z (z) = l[P Z (z l \P(i,z)). (26) 

1=1 

(By definition, if V(i,z)) is empty, P z (zi | P(i,z)) is 
just the z’th marginal of P z , Pz(zi)-) Note that any 
distribution P z is a member of A(V) for some V) — in 
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the worst case, just choose the exhaustive parent function 
V{i,z) = { Zj : j > ij. 

For any choice of V there is an associated set of distri- 
butions C(2x) that equals A{V) exactly: 

Theorem: Define the components of X using multiple 
indices: For all i € { 1 , 2 , . . . , IV} and possible associated 
values (as one varies over z £ Z) of the vector V{i. z), 
there is a separate component of x, x l; -py z y This com- 
ponent can take on any of the values that z l can. Define 
£(-) recursively, starting at i = N and working to lower 
i, by the following rule: V i £ {1, 2, . . . , N}, 

[C ( a ')]i = 

Then A(V) = £ (Qx). 

Proof: First note that by definition of parent functions, 
due to the fact that we’re iteratively working down from 
higher f's to lower ones, £(x) is properly defined. Next 
plug that definition into Eq. 25. For any particular x and 
associated z = £(x), those components of x that do not 
“match” z by having their second index equal V(i. z) get 
integrated out. After this the integral reduces to 

N 

Pz{z) = J|Px([n l; p (tiZ )] = Zi), 

i= 1 

i.e., is exactly of the form stipulated in Eq. 26. Accord- 
ingly, for any fixed x and associated z = £(x), ranging 
over the set of all values between 0 and 1 for each of 
the distributions Px([xy-p(i,z) = z i) will result in rang- 
ing over all values for the distribution Pz( z ) that are of 
the form stipulated in Eq. 26. This must be true for 
all x. Accordingly, £ ,(Qx ) Q A(V). The proof that 
A(V) C £ ( Q x ) goes similarly: For any given Pz and 
z, simply set Px([xi--p(i,z)} = z i) for all the indepen- 
dent components x,- V y ^ of x and evaluate the integral 
in Eq. 25. QED. 

Intuitively, each component of x in the lemma is the 
conditional distribution Pz{ z i | P(i, z )) for some particu- 
lar instance of the vector P(i, z)). The lemma means that 
in principle we never need consider coupled distributions. 
It suffices to restrict attention to product distributions, 
so long as we use an appropriate semicoordinate system. 
In particular, mixture models over Z can be represented 
this way. 


C. Maxent Lagrangians over X rather than Z 

While the distribution over X uniquely sets the distri- 
bution over Z, the reverse is not true. However so long as 
our LagTangian directly concerns the distribution over X 
rather than the distribution over Z, by minim izing that 
Lagrangian we set a distribution over Z. In this way 


we can minimize a Lagrangian involving product distri- 
butions, even though the associated distribution in the 
ultimate space of interest is not a product distribution. 

The Lagrangian we choose over X should depend on 
our prior information, as usual. If we want that La- 
grangian to include an expected value over Z (e.g., of 
a cost function), we can directly incorporate that ex- 
pectation value into the Lagrangian over X , since ex- 
pected values in X and Z are identical: f dzPz(z)A(z ) = 
f dxPx(x)A(((x)) for any function A{z). (Indeed, this 
is the standard justification of the rule for transforming 
probabilities, Eq. 25.) 

However other functionals of probability distributions 
can differ between the two spaces. This is especially com- 
mon when £(.) is not invertible, so A is larger than Z. 
In particular, while the expected cost term is the same 
in the X and Z maxent Lagrangians, this is not true of 
the two entropy terms in general; typically the entropy 
of a q € Q will differ from that of its image, £(<?) £ £(Q) 
in such a case. 

More concretely, the fully formal definition of entropy 
includes a prior probability p: Sx = J dxp(x) ln(^||), 
and similarly for Sz- So long as p(x) and p(z) are related 
by the normal laws for probability transformations, as 
are p{x) and p(z), then if the cardinalities of X and Z 
are the same, Sz = Sx [51]. When the cardinalities of 
the spaces differ though (e.g., when X and Z are both 
finite but with differing numbers of elements), this need 
no longer be the case. The following result bounds how 
much the entropies can differ in such a situation: 


Theorem: For all z € Z, take fi(x) to be uniform over 
all x such that ((x) = z. Then for any distribution p(x) 
and its image p(z). 



dz p(z) ln(AT( 2 :)) 


< Sx — Sz < 0 , 


where K(z ) = f dx6(z — C(x)). (Note that for finite X 
and Z , K(z) > 1, and counts the number of x with the 
same image z.) If we ignore the p terms in the definition 
of entropy, then instead we have 


0 < Sx — Sz < — J dz p(z) ln(K(z)). 
Proof: Write 

Sx = ~ J dz J dx 8{z - <(x)) p(x) lnfj^y] 
= — J dz j dx S (z — C(aO) P( x ) x 

= — J dz p{z)\n[d{z)\ — 
fdzjdxi(z- ax)) P(x) 
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where d z = f dx S(z — (( x )) Define p z to be the 

common value of all p{x) such that £( x ) = 2 . So p(z ) = 
p z K(z) and p(z ) = p z d(z). Accordingly, expand our 
expression as 

S x = -J dzp(z) ] n (^y] “ J dzp(z)K(z) - 
Jizjixs 

= S z - J dz p(z)K(z) + 

Jdz p{z) (- 1 dx 6(z - ax)) jgj ln[g|j). 

The x-integral of the right-hand side of the last equation 
is just the entropy of normalized the distribution de- 
fined over those x such that C(z) = z. Its maximum and 
minimum are ln[A'(z)] and 0, respectively. This proves 
the first claim. The second claim, where we “ignore the 
/j terms”, is proven similarly. QED. 

In such cases where the cardinalities of A and Z differ, 
we have to be careful about which space we use to formu- 
late our Lagrangian. If we use the transformation <j(.) as 
a tool to allow us to analyze bargaining games with bind- 
ing contracts, then the direct space of interest is actually 
the x's (that is the place in which the players make their 
bargaining moves). In such cases it makes sense to apply 
all the analysis of the preceding sections exactly as it is 
written, concerning Lagrangians and distributions over 
x rather than z (so long as we redefine cost functions 
to implicitly pre-apply the mapping £(.) to their argu- 
ments). However if we instead use ((.) simply as a way 
of establishing statistical dependencies among the moves 
of the players, it may make sense to include the entropy 
correction factor in our x-space Lagrangian. 

An important special case is where the following three 
conditions are met: Each point z is the image under 
C(-) of the same number of points in x-space, n; p(x) 
is uniform (and therefore so is p(z)); and the Lagrangian 
in x-space, I£ x , is a sum of expected costs and the en- 
tropy. In this situation, consider a z-space Lagrangian, 
Jzfi,, whose functional dependence on P z , the distribution 
over z’s, is identical to the dependence of Jz? x on P x , ex- 
cept that the entropy term is divided by n [52], Now 
the minimizer P*(x) of 3£ x is a Boltzmann distribution 
in values of the cost function (s). Accordingly, for any 
z, P*(x) is uniform across all n points x € £ _1 (z) (all 
such x have the same cost value(s)). This in turn means 
that 5(C(Hi)) = nS(P z ). So our two Lagrangians give 
the same solution, i.e., the “correction factor” for the 
entropy term is just multiplication by n. 

D. Semicoordinate transformations in team games 

Now consider situations in which one wishes to find 
the global minimum of the Lagrangian for a team game. 


To illustrate the generality of the arguments, situations 
where one has to to use Monte Carlo estimates of con- 
ditional expectation values to descend the shared La- 
grangian (rather than evaluate them closed-form) will be 
considered. 

Say we are currently at a local minimum q G Q of 
-Sf of the team game. Usually we can break out of that 
minimum by raising (3 and then resuming the updating; 
typically changing 0 changes _Sf so that the Lagrange 
gaps are nonzero. So if we want to anneal (3 anyway 
(e.g., to find a minimum of the shared cost function G ), 
it makes sense to do so to break out of any local minima. 

There are many other ways to break out of local min- 
ima without changing the Lagrangian (as we would if we 
changed f3 , for example) [31]. Here we show how to use 
semicoordinate transformations to do this. As explicated 
below, they also provide a general way to lower the value 
of the Lagrangian, whether or not one has local minimum 
problems. 

Say our original semicoordinate system is C 1 (-)- Switch 
to a different semicoordinate system C 2 (0 for Z and con ~ 
sider product distributions over the associated space X 2 . 
Geometrically, the semicoordinate transformation means 
we change to a new submanifold C 2 (Q) C V without 
changing the underlying mapping from p(z) to 

As a simple example, say Q 2 is identical to C 1 except 
that it joins two components of x into an aggregate semi- 
coordinate. Since after that change we can have statis- 
tical dependencies between those two components, the 
product distributions over A 2 , C 2 (Sx 2 )i map to a su- 
perset of C 1 (Qx 1 )- Typically the local minima of that 
superset do not coincide with local minima of C 1 (2x 1 )- 
So this change to X 2 will indeed break out of the local 
minimum, in general. 

More care is needed when working with more com- 
plicated semicoordinate transformations. Say before the 
transformation we are at a point p* € C 1 (2x 1 )- Then in 
general p* will not be in the new manifold C 2 (2x 2 )> he., 
p* will not correspond to a product distribution in our 
new sernicoordinate system. (This reflects the fact that 
semicoordinate transformations couple the players.) Ac- 
cordingly, we must change from p* to a new distribution 
when we change the semicoordinate system. 

To illustrate this, say that the semicoordinate trans- 
formation is bijective. Formally, this means that X 2 = 
X 1 = X and £ 2 (x) = f° r a bijective £(.). Have 

£(.), the mapping from X 2 to A 1 , be the identity map 
for all but a few of the M total components of A, in- 
dicated as indices 1 — ► n. Intuitively, for any fixed 
x\ +l ^ M = x n+ i_A/, the effect of the semicoordinate 
transformation to £ 2 (-) from C 1 ^) is merely to “shuffle” 
the associated mapping taking semicoordinates 1 — >• n to 
Z , as specified by £(.). Moreover, since £(.) is a bijection, 
the maxent Lagrangians over A 1 and A 2 are identical: 
^xmp xi )) = 3P x ,((p x2 )). 

Now say we set q^h^M = q* +1 ^ M - This means we 
can estimate the expectations of G conditioned on pos- 
sible x\_ tn from the Monte Carlo samples conditioned 



11 


on f(xf_ n ). In particular, for any £(.) we can estimate 
E(G) as f (xj^ n )E(G | £{xi^J) in the usual 

way. Now entropy is the sum of the entropy of semicoor- 
dinates n + 1 — » M plus that of semicoordinates 1 — *■ n. 
So for any choice of £(.) and q x _ n , we can approximate 
Jz?x = -^x 2 as (our associated estimate of) E(G) mi- 
nus the entropy of p x _> n , minus a constant unaffected by 
choice of £(.). 

So for finite and small enough cardinality of the sub- 
space we can use our estimates E(G | £(:!?_„)) 

to search for the “shuffling” £(.) and distribution q x _ n 
that minimizes E£ x [53]. In particular, say we have de- 
scended JE'x to a distribution q x ( x ) = q*{x). Then 
we can set q x = g*. and consider a set of of “shuffling 
£(.)”. Each such £(.) will result in a different distribu- 
tion q xl {x) — g-* 2 (f -1 (x)) = g*(£ -1 (x)). While those 
distributions will have the same entropy, typically they 
will have different (estimates of) E{G) and accordingly 
different local minima of the Lagrangian. 

Accordingly, searching across the £(.) can be used to 
break out of a local minimum. However since E(G) 
changes under such transformations even if we are not at 
a local mi nim um we can searching across £(.), as a new 
way (in addition to those discussed above) for lowering 
the value of the Lagrangian. Indeed, there is always a 
bijective semicoordinate transformation that reduces the 
Lagrangian: simply choose £(.) to rearrange the G(x) so 
that G{x) < G(x') <=> q(x) < q(x'). In addition one can 
search for that £(.) in a distributed fashion, where one af- 
ter the other each agent i rearranges its semicoordinate to 
shrink E(G). Furthermore to search over semicoordinate 
systems we don’t need to take any additional samples of 
G. (The existing samples can be used to estimate the 
E{G) for each new system.) So the search can be done 
off-line. 

To determine the semicoordinate transformation we 
can consider other factors besides the change in the value 
of the Lagrangian that immediately arises under the 
transformation. We can also estimate the the amount 
that subsequent evolution under the new semicoordinate 
system will decrease the Lagrangian. We can estimate 
that subsequent drop in a number of ways: the sum of 
the Lagrangian gaps of all the agents, gradient of the 
Lagrangian in the new semicoordinate system, etc. 


E. Distributions over semicoordinate systems 

The straightforward way to implement these kinds of 
schemes for finding a good semicoordinate systems is via 
exhaustive search, hill-climbing, simulated annealing, or 
the like. Potentially it would be very useful to instead 
find a new semicoordinate system using search techniques 
designed for continuous spaces. When there are a finite 
number of semicoordinate systems (i.e., finite X and Z) 
this would amount to using search techniques for contin- 
uous space to optimize a function of a variable having a 


fini te number of values. However we now know how to do 
that: use PD theory. In the current context, this means 
placing a product probability distribution over a set of 
variables parameterizing the semicoordinate system, and 
then evolving the probability distribution. 

More concretely, write 

N 

e x i=i 

N 

-f EEn qi(Xi)P(9)G(C(x,e)) + S(q) 

e x i=i 

where 6 is a parameter on the semicoordinate system. 
We can rewrite this using an additional semicoordinate 
transformation, as 

N + 1 

■W) = P zn £COG(CCO) + S(g*) ( 28 ) 

x* i=l 

where x* = x t for all i up to N, and x* N+1 = 0- (As usual, 
depending on what space we cast our Lagrangian in, the 
entropy can either have the argument of the entropy term 
starred — as here — or not.) 

Intuitively, this approach amounts to introducing a 
new coordinate/agent, whose “job” is to set the semi- 
coordinate system governing the mapping from the other 
agents to a z value. This provides an alternative to pe- 
riodically (e.g., at a local minim um) picking a set of al- 
ternative semicoordinate systems and estimating which 
gives the biggest drop in the overall Lagrangian. We 
can instead use Nearest Newton, Brouwer updating, or 
what have you, to continuously search for the optimal 
coordinate system as we also search for the optimal x. 
The tradeoff, of course, is that by introducing an extra 
coordinate/agent, we raise the noise level all the origi- 
nal semicoordinates experience. (This raises the issue of 
what best parameterization of ((.) to use, an issue not 
addressed here.) 

VI. RELATED WORK AND EXTENSIONS 

The core of this paper is the maxent Lagrangian and 
associated Boltzmann distribution solution. These have 
been investigated for well over a century in the statistical 
physics. The use of the Boltzmann distribution over pos- 
sible moves also has a long history in the RL literature. 
In all of this RL work though the Boltzmann distribu- 
tion is usually motivated either as an e priori reasonable 
way to trade off exploration and exploitation, as part of 
Markov Chain Monte Carlo procedure, or by its asymp- 
totic convergence properties [30]. 

Independent of the work in [4], the maxent Lagrangian 
and/or the Boltzmann distribution has previously been 
suggested as a way to model human players [16, 34, 35]. 
Some of that work has explicitly noted the relation be- 
tween the Boltzmann distribution and statistical physics 
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[36]. However the motivation of the maxent Lagrangian 
and Boltzmann distribution in that work is ad hoc, based 
on particular simple models of human decision-making 
and/or of player interactions. There is no use of infor- 
mation theory to derive the maxent Lagrangian from first 
principles, as is done in PD theory. 

Some of the benefits of such a first principles approach 
are presented in this paper. Others are reported in [4], 
These include an explicit term in the analysis that, in 
light of information theory, corresponds to cost of com- 
putation. Other benefits are natural ways to accommo- 
date multiple cost functions per player. PD theory also 
highlights the very close relationship betweeen bounded 
rational game theory and statistical physics. This rela- 
tionship allows many of the tools of statistical physics 
to be applied to bounded rational games. For example, 
by exploiting the grand canonical ensemble of stastisti- 
cal physics, they allow one to analyze bounded rational 
games with variable numbers of players — in essence, a 


bounded rational extension of evolutionary game theory 

W- 

Finally, it’s important to note that PD theory has 
many applications beyond those considered in this pa- 
per. For example, see [8, 31, 37-40] for other work re- 
lating the maxent Lagrangian to distributed control and 
to distributed optimization. See [31] for algorithms for 
speeding up convergence to bounded rational equilibria. 
Some of those algorithms are related to simulated and 
deterministic annealing [41]. See also [42, 43] for work 
showing, respectively, how to use PD theory to improve 
Metropolis-Hastings sampling and how to extend it to 
continuous move spaces and time-extended strategies. 
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